Deep Multimodal Learning for Emotion Recognition in Spoken Language

نویسندگان

  • Yue Gu
  • Shuhong Chen
  • Ivan Marsic
چکیده

In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language. Our architecture has two distinctive characteristics. First, it extracts the high-level features from both text and audio via a hybrid deep multimodal structure, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features. Second, we fuse all features by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global finetuning of the entire structure. We evaluated the proposed framework on the IEMOCAP dataset. Our result shows promising performance, achieving 60.4% in weighted accuracy for five emotion categories.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The multimodal nature of spoken word processing in the visual world: Testing the predictions of alternative models of multimodal integration

Ambiguity in natural language is ubiquitous (Piantadosi, Tily & Gibson, 2012), yet spoken communication is effective due to integration of information carried in the speech signal with information available in the surrounding multimodal landscape. However, current cognitive models of spoken word recognition and comprehension are underspecified with respect to when and how multimodal information...

متن کامل

Emotion Recognition Using Multimodal Deep Learning

To enhance the performance of affective models and reduce the cost of acquiring physiological signals for real-world applications, we adopt multimodal deep learning approach to construct affective models with SEED and DEAP datasets to recognize different kinds of emotions. We demonstrate that high level representation features extracted by the Bimodal Deep AutoEncoder (BDAE) are effective for e...

متن کامل

Music Emotion Recognition via End-to-End Multimodal Neural Networks

Music emotion recognition (MER) is a key issue in user contextaware recommendation.Many existingmethods require hand-crafted features on audio and lyrics. Here we propose a new end-to-end method for recognizing emotions of tracks from their acoustic signals and lyrics via multimodal deep neural networks. We evaluate our method on about 7,000 K-pop tracks labeled as positive or negative emotion....

متن کامل

Multimodal Emotion Recognition Using Multimodal Deep Learning

To enhance the performance of affective models and reduce the cost of acquiring physiological signals for real-world applications, we adopt multimodal deep learning approach to construct affective models from multiple physiological signals. For unimodal enhancement task, we indicate that the best recognition accuracy of 82.11% on SEED dataset is achieved with shared representations generated by...

متن کامل

Modeling affected user behavior during human-machine interaction

Spoken human-machine interaction supported by state-of-theart dialog systems is becoming a standard technology. A lot of effort has been invested for this kind of artificial communication interface. But still the spoken dialog systems (SDS) are not able to provide to the users a natural way of communication. Most part of the existing automated dialog systems is based on a questionnaire based st...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1802.08332  شماره 

صفحات  -

تاریخ انتشار 2018